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Performance of software using TCP/IP sockets to distribute events to UNIX work- 
stations is described. This simple software was written at the University of Missis- 
sippi to control UMiss farm reconstruction of 8 billion raw events, part of Fcrmilab 
E791's data. E791 reconstructed HEP's largest data set to study charm physics. 



Fermilab E791 wrote a big dataset (50 Terabytes, 20 billion events, 24000 
8mm Exabyte tapes) in 1991 and early 1992. afcc Reconstruction challenged 
available computing, requiring over 10 mips-years. The task was larger than at 
colliders (Table 1). Reconstruction was nevertheless completed using four farm 
sites. d Here we describe the multiprocessor management software 13 developed 
and run at the University of Mississippi farm (Figs. 1 and 2 show hardware). 

HEP events are usually independent. Interprocess I/O isn't needed. An 
efficient parallel system just has to input and output events fast enough so 
clients are never idle. Management software had to do a lot of hard work in 
early HEP systems/ 9 Clients had minimal operating systems. All data had to 
be formatted in a server and downloaded into clients word by word. Moving 
from single to multiple CPUs was hard; the division between server and 






Figure 1: Mississippi farm overview. Servers are on the four tables. Clients are on the racks 
shown and on desktops not shown. The espresso machine is on the left. 
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Figure 2: Mississippi farm ar- 
chitecture. Servers and clients 
are DECstation 5000 worksta- 
tions running ULTRIX. Some 
have MIPS R3000 CPUs; oth- 
ers the more powerful MIPS 
R4000. Altogether, there are 
68 processors organized into 
4 farms isolated by Ethernet 
bridges. One typical farm is 
shown here. The two input 
tape drives alternate automat- 
ically. Each drive is only wear- 
ing itself out half the time. 
If an input tape drive fails, 
the next tape starts automat- 
ically. The output is staged 
through disk and streamed to 
tape. If the output tape drive 
fails output data can easily be 
recovered from disk. If a disk 
fills, processing is automati- 
cally paused until space ap- 
pears. This I/O scheme avoids 
constant operator supervision. 



client code was intricate. With the advent 
of commercial workstation clients with 
real operating systems, most work inherent 
in moving to multiprocessors vanished. Us- 
ing Network File System software, server 
disks can be cross-mounted so that files are accessible by 
multiple clients. In this model, even inexpensive disk- 
less clients directly read an executable code file, a run 
number file, calibration files, a raw input record file of 
events, and write report files and reconstructed event 
files. The server writes input events from tape to disk 
files. At the end of a job, the server copies client output 
event files to tape and combines client reports, as clients 
work on the next job. Because 85% of E791 events were 
filtered away after reconstruction, disk output was fast 
enough for us. Event input by disk also worked, but too 
slowly. So, our multiprocessor manager bypasses disk 
for input using instead Transmission Control Protocol/ 
Internet Protocol. 1 With TCP/IP, processes make a 
connection between themselves and pass data 
back and forth using read_from_connection and 
writeJhrough_connection subroutine calls. A test 
of TCP/IP gave 900 kbyte/s, ending client idle- 
ness. Fig. 3 illustrates how the network I/O calls 
are used. As the server prepares to start a client, 
it uses makesocket to "have a phone put in", so 
that it will be able to connect to the client. When 
the client starts, it too uses makesocket to "have 
a phone put in" . The server "lists its number" 
by binding its socket to a port (bindsocket) , and 
"stays near the phone" listening for an attempt 
to connect (listensocket) . The client "calls up" 
the server (connects ocket) and the server "picks 
up the phone" establishing the connection (ac- 
cepts ocket) . When it needs input data, the client 
"places its order" by writing a message to the 
server (writesocket) . The server is continually 
monitoring all of the client connections for requests 
(selectsocket) . When a request comes in, the 
server "writes down the order" (reads ocket) , and 
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Table 1: E791 raw data size and pp, 
e~p, and e+e~ collider experiment 
sizes. DO saves digitized waveforms. 
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FOR LISTEN_SOCKET to listen 
EVERY f or a ca || f rom the client 
CLIENT... 

ACCEPT_SOCKET to 
establish a connection to 
the client 

SELECT_SOCKET to see - 
which client needs service. 
READ_SOCKET to obtain 
the request message. 
FOR WRITE_SOCKET to send - 
EVERY data length info or 
INPUT no_more_data indication. 
RECORD... 

SELECT_SOCKET to see "* 
which client needs service. 
READ_SOCKET to obtain 
the request message. . 
WRITE_SOCKET to send 
the data 



-MAKE_SOCKET to connect 
with the server 



- CONNECT_SOCKET to call 
the server 



- WRITE_SOCKET to ask for 
length of next data record 

- READ_SOCKET to read the 
data length 

,WRITE_SOCKET to ask for 
the data. If no_more_data... 
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CLOSE_SOCKET to 
make sure the 
connection is broken 



„ READ_SOCKET to read the 
' data. 

PROCESS THE DATA and 
write the output to disk. 



CLOSE_SOCKET to break 
the connection 



WRITE REPORT and EXIT 



WRITE REPORT and EXIT 

Figure 3: TCP/IP socket communication. 



does its best to satisfy the 
client's request. The client 
and server shuttle messages 
back and forth (each writ- 
ing to the other and reading 
from the other) until the in- 
put data is exhausted. Next, 
the server notifies the clients 
that there is no more data. 
The clients then finish their tasks, close their connections (closesocket) , and 
exit; the server finishes its tasks and exits. The Fortran-callable C routines 
that manipulate sockets and connections really are that simple to use. Only 
read-socket is more than a C to Fortran interface, and even its trivial. Most of 
the real work has already been done in UNIX, TCP/IP, and Berkeley Sockets. 

Although most of our farm processors are in racks, some are on people's 
desks. We have found it satisfactory to allow users to abruptly kill the client 
process whenever they find its activities on their workstations to be trouble- 
some. A reconstruction code crash also kills a client. In either case, the server 
is quickly aware of the dead client and adjusts event distribution. A disadvan- 
tage of this approach is that a few input events are trapped and lost. Having 
20 billion events, we take a rather cavalier attitude. In E791, processing raw 
events is rather like hauling corn to market in a truck. If a few grains of corn 
fall out of the truck, no one cares. The alternative approach - treating events 
as babies in a hospital nursery, where one normally expects a somewhat stricter 
accounting - only makes sense later with small selections of interesting events. 
Before writing our own multiprocessor software we considered extracting 
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the few features needed from large packages under development at Fermilab 3 k 
and Argonne/ However, offsite support was unavailable. So we focused on 
writing software which could do a limited number of things very well; e.g. run 
many clients per server efficiently and tolerate client crashes and operating 
system upgrades. Six man-weeks were spent coding. Farm operation required 
5 hours over a day. After seeing our approach to moving farm data, the DO 
experiment decided to follow a similar strategy™ 

Funding in June 1993 allowed an expansion of the UMiss farm from 1100 to 
2900 mips. By July 1993, the increased computing was acquired and processing 
data. E791 reconstruction was completed in Sept. 1994. A total of 8 billion 
events on 10 000 raw data tapes were processed in Mississippi. Before running 
final reconstruction, dozens of full farm tests of algorithms for actual charm 
yield were run, each test for a few days. The charm yield tripled. X Window 
operator control displays written in Tcl/Tk aided bookkeeping. Tape reading 
was multiply buffered, so that events were almost always available immediately 
when a client asked for them. During smooth running, timing CPUs showed 
that at least 97% of client processing cycles were used. Overall efficiency, 
considering cycles lost for any reason, exceeded 90% over a 2^ year period. 

Efficient management of multiple processors has led to the reconstruc- 
tion of 200 000 charm particles, the world's largest sample. Results Q include 
DPF '96 papers by N. Copty, L. Cremaldi, K. Gounder, M. Purohit, K.C. Peng, 
A. Tripathi, R. Zaliznyak, and C. Zhang. We especially thank Lucien Cremaldi 
and Breese Quinn for their contributions to building and running the UMiss 
farm. This work was supported in part by U.S. DOE DE-FG05-91ER40622. 
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